Building Initial Monetary Models
In this workbook we investigate different ways to model the transaction frequency of individual customers, with a view to expanding this approach into a more traditional P/NBD model.
1 Load and Configure Datasets
We first want to load some synthetic transaction data.
customer_cohort_tbl <- read_rds("data/synthdata_allfixed_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()## Rows: 50,000
## Columns: 4
## $ customer_id <chr> "C201901_0001", "C201901_0002", "C201901_0003", "C20190…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", …
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", …
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-0…
customer_simparams_tbl <- read_rds("data/synthdata_allfixed_simparams_tbl.rds")
customer_simparams_tbl %>% glimpse()## Rows: 50,000
## Columns: 9
## $ customer_id <chr> "C201901_0001", "C201901_0002", "C201901_0003", "C2019…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1",…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01",…
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
## $ customer_mu <dbl> 0.07041543, 0.05044251, 0.04131343, 0.06001408, 0.1847…
## $ customer_tau <dbl> 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53…
## $ customer_lambda <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
## $ customer_nu <dbl> 0.59797122, 4.02749754, 0.19434751, 3.30778077, 1.1879…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
customer_transactions_tbl <- read_rds("data/synthdata_allfixed_transactions_tbl.rds")
customer_transactions_tbl %>% glimpse()## Rows: 179,043
## Columns: 5
## $ customer_id <chr> "C201901_0001", "C201901_0001", "C201901_0001", "C201901…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 17:18:06, 2019-02-08 01:32:56, 2019-07-10 03…
## $ tnx_amount <dbl> 191.88, 158.80, 160.51, 157.93, 135.85, 197.42, 23.02, 2…
Our transaction data is our main input data for this work, so we will show the first few rows of this.
customer_transactions_tbl %>% arrange(tnx_timestamp) %>% head(10)## # A tibble: 10 × 5
## customer_id cohort_qtr cohort_ym tnx_timestamp tnx_amount
## <chr> <chr> <chr> <dttm> <dbl>
## 1 C201901_0098 2019 Q1 2019 01 2019-01-01 00:02:32 140.
## 2 C201901_0066 2019 Q1 2019 01 2019-01-01 00:03:23 1390.
## 3 C201901_0039 2019 Q1 2019 01 2019-01-01 00:11:23 207.
## 4 C201901_0017 2019 Q1 2019 01 2019-01-01 00:22:12 116.
## 5 C201901_0105 2019 Q1 2019 01 2019-01-01 00:24:23 117.
## 6 C201901_0063 2019 Q1 2019 01 2019-01-01 00:34:58 428.
## 7 C201901_0014 2019 Q1 2019 01 2019-01-01 00:40:01 281.
## 8 C201901_0042 2019 Q1 2019 01 2019-01-01 00:40:36 353.
## 9 C201901_0003 2019 Q1 2019 01 2019-01-01 00:42:04 458.
## 10 C201901_0104 2019 Q1 2019 01 2019-01-01 00:53:29 220.
We also want to set up a number of parameters for use in this workbook
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"1.1 Construct Visualisation Data
customer_summarystats_tbl <- customer_transactions_tbl %>%
calculate_transaction_summary_stats()
customer_summarystats_tbl %>% glimpse()## Rows: 50,000
## Columns: 9
## $ customer_id <chr> "C201901_0001", "C201901_0002", "C201901_0003", "C201901_…
## $ tnx_count <int> 6, 6, 7, 5, 10, 4, 9, 6, 4, 1, 5, 5, 4, 10, 6, 5, 6, 8, 1…
## $ first_tnx_ts <dttm> 2019-01-01 17:18:06, 2019-01-01 02:47:52, 2019-01-01 00:…
## $ last_tnx_ts <dttm> 2019-12-23 06:58:59, 2019-12-03 07:24:05, 2019-10-15 05:…
## $ btyd_count <dbl> 5, 5, 6, 4, 9, 3, 8, 5, 3, 0, 4, 4, 3, 9, 5, 4, 5, 7, 9, …
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 50.79572, 48.02740, 41.02669, 14.77388, 51.88295, 28.1065…
## $ obs_freq <dbl> 0.09843349, 0.10410723, 0.14624625, 0.27074803, 0.1734674…
## $ emp_freq <dbl> 0.09615385, 0.09615385, 0.11538462, 0.07692308, 0.1730769…
We want to view the transactions to get a sense of how regular our transactions are in general.
plot_tbl <- customer_transactions_tbl %>%
group_nest(customer_id, .key = "cust_data") %>%
filter(map_int(cust_data, nrow) > 3) %>%
slice_sample(n = 30) %>%
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))1.2 Construct Data Subsets
For the purposes of saving on time and computation, we construct a subset of
the data first, and rather than constructing different samples each time we
instead construct sets of customer_id to draw upon, then using those list
of values when we want to take a subset.
We do this by randomly shuffling the data and then selecting the top \(n\) values
of customer_id.
shuffle_tbl <- customer_summarystats_tbl %>%
slice_sample(prop = 1, replace = FALSE)
id_1000 <- shuffle_tbl %>% head(1000) %>% pull(customer_id) %>% sort()
id_5000 <- shuffle_tbl %>% head(5000) %>% pull(customer_id) %>% sort()
id_10000 <- shuffle_tbl %>% head(10000) %>% pull(customer_id) %>% sort()We now have a list of customer_id values we use to subset the data.
2 Fit Initial Transaction Amount Models
We first want to use these samples of the dataset to look at it.
fit_1000_data_tbl <- customer_transactions_tbl %>%
filter(customer_id %in% id_1000)
fit_1000_data_tbl %>% glimpse()## Rows: 3,556
## Columns: 5
## $ customer_id <chr> "C201901_0025", "C201901_0084", "C201901_0084", "C201901…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 04:00:15, 2019-01-01 08:39:04, 2019-01-07 20…
## $ tnx_amount <dbl> 68.62, 69.93, 77.09, 84.40, 74.79, 83.76, 78.96, 65.58, …
fit_10000_data_tbl <- customer_transactions_tbl %>%
filter(customer_id %in% id_10000)
fit_10000_data_tbl %>% glimpse()## Rows: 35,972
## Columns: 5
## $ customer_id <chr> "C201901_0002", "C201901_0002", "C201901_0002", "C201901…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 02:47:52, 2019-02-01 02:21:10, 2019-04-28 18…
## $ tnx_amount <dbl> 23.02, 22.42, 30.02, 28.29, 25.66, 26.07, 46.25, 39.93, …
2.1 Fit Simple Flat Stan Model
We start by fitting our first Stan model, fitting a Poisson rate for each individual customer.
## data {
## real<lower=0> p; // parameter 1 for Gamma top level
## real<lower=0> q; // parameter q for Gamma nu draw
## real<lower=0> g; // parameter g for Gamma nu draw
##
## int<lower=1> n; // count of transactions
## int<lower=1> n_customer_id; // count of customers
##
## array[n] int<lower=0> customer_id; // index for customer
##
## vector<lower=0>[n] tnx_amt; // transaction amount
## }
##
##
## parameters {
## vector<lower=0>[n_customer_id] nu;
## }
##
##
## transformed parameters {
## vector<lower=0>[n_customer_id] cust_mean = (p / nu);
## vector<lower=0>[n_customer_id] cust_cv;
##
## for(i in 1:n_customer_id) {
## cust_cv[i] = sqrt(p);
## }
## }
##
## model {
## nu ~ gamma(q, g);
##
## for(i in 1:n) {
## tnx_amt[i] ~ gamma(p, nu[customer_id[i]]);
## }
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = gamma_lpdf(tnx_amt[i] | p, nu[customer_id[i]]);
## }
## }
We now compile this model using CmdStanR.
amtmodel_flat_stanmodel <- cmdstan_model(
"stan_code/amtmodel_flat.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "amtmodel_flat"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, tnx_amt = tnx_amount) %>%
compose_data(
p = 100,
q = 1,
g = 1
)
amtmodel_flat_stanfit <- amtmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 14.4 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 14.6 seconds.
## Chain 3 finished in 14.5 seconds.
## Chain 4 finished in 14.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 14.5 seconds.
## Total execution time: 14.8 seconds.
amtmodel_flat_stanfit$summary()## # A tibble: 6,557 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -5.41e+5 -5.41e+5 2.22e+1 2.22e+1 -5.41e+5 -5.41e+5 1.01 642.
## 2 nu[1] 1.45e+0 1.44e+0 1.40e-1 1.42e-1 1.23e+0 1.69e+0 1.00 3190.
## 3 nu[2] 1.32e+0 1.32e+0 4.39e-2 4.42e-2 1.25e+0 1.40e+0 1.01 4330.
## 4 nu[3] 1.01e-1 1.01e-1 3.93e-3 3.79e-3 9.45e-2 1.07e-1 1.00 2308.
## 5 nu[4] 3.49e-1 3.49e-1 1.16e-2 1.12e-2 3.30e-1 3.68e-1 1.00 3094.
## 6 nu[5] 1.21e+0 1.21e+0 3.53e-2 3.45e-2 1.15e+0 1.27e+0 1.00 4167.
## 7 nu[6] 7.67e-2 7.67e-2 2.90e-3 2.87e-3 7.21e-2 8.15e-2 1.00 3199.
## 8 nu[7] 7.42e-2 7.41e-2 5.20e-3 5.23e-3 6.58e-2 8.32e-2 1.00 3989.
## 9 nu[8] 7.07e-1 7.08e-1 2.36e-2 2.37e-2 6.68e-1 7.46e-1 1.00 4385.
## 10 nu[9] 2.40e+0 2.40e+0 1.21e-1 1.22e-1 2.21e+0 2.60e+0 1.01 3755.
## # … with 6,547 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
amtmodel_flat_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_amtmodel_flat-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
2.1.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"nu[1]", "nu[2]", "nu[3]", "nu[4]", "nu[5]", "nu[6]"
)
amtmodel_flat_stanfit$draws(inc_warmup = TRUE) %>%
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Posterior Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
amtmodel_flat_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(
pars = parameter_subset
) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Posterior Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
amtmodel_flat_stanfit %>%
rhat(pars = c("nu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
amtmodel_flat_stanfit %>%
neff_ratio(pars = c("nu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
amtmodel_flat_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Posterior Values")In practice we do not always require a comprehensive exploration of the diagnostics, but it is good practice to run through the various visualisations when we fit a model to ensure our sample is valid.
2.1.2 Check Model Fit
We now need to check the parameters of this fit against the data to see how effective our model is at capturing the data. In this case we have the benefit of knowing the ‘true’ data, and so we compare our model output against the input parameters.
amtmodel_flat_validation_tbl <- amtmodel_flat_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(nu[customer_id]) %>%
ungroup() %>%
inner_join(
customer_summarystats_tbl %>% select(customer_id, tnx_count),
by = "customer_id"
) %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, draw_id = .draw, tnx_count,
post_nu = nu, customer_p, customer_nu
)
amtmodel_flat_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 6
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C201901_0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ tnx_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ post_nu <dbl> 1.41266, 1.46895, 1.47528, 1.34603, 1.53986, 1.37628, 1.58…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ customer_nu <dbl> 1.546918, 1.546918, 1.546918, 1.546918, 1.546918, 1.546918…
Having constructed the validation data we now want to check the quantile of each ‘true’ value in the posterior distribution for the parameter. If our model is valid, this distribution will be uniform on \([0, 1]\).
amtmodel_flat_nu_qvalues_tbl <- amtmodel_flat_validation_tbl %>%
calculate_distribution_qvals(post_nu, customer_nu, customer_id)
ref_value <- amtmodel_flat_nu_qvalues_tbl %>% nrow() %>% divide_by(50)
ggplot(amtmodel_flat_nu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(amtmodel_flat_nu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_nu)) +
labs(
x = "Quantile",
y = "Customer p",
title = "Scatterplot of q-Value against nu"
)In real world examples we will not have any “true” parameter values though, so we also want to look at the distribution of \(q\)-values against the derived ECDFs:
amtmodel_flat_amt_qvals_tbl <- fit_1000_data_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
tnx_vals = list(tnx_amount)
) %>%
inner_join(
amtmodel_flat_validation_tbl %>% select(customer_id, customer_p, post_nu),
by = "customer_id"
) %>%
mutate(
q_vals = pmap(
list(x1 = tnx_vals, x2 = customer_p, x3 = post_nu),
function(x1, x2, x3) pgamma(q = x1, shape = x2, rate = x3)
)
) %>%
select(customer_id, customer_p, post_nu, q_vals) %>%
unnest(q_vals)
amtmodel_flat_amt_qvals_tbl %>% glimpse()## Rows: 7,112,000
## Columns: 4
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C201901_0…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ post_nu <dbl> 1.41266, 1.46895, 1.47528, 1.34603, 1.53986, 1.37628, 1.58…
## $ q_vals <dbl> 0.39124142, 0.54500141, 0.56205240, 0.22652759, 0.72218614…
2.2 Fit Flat Stan Model with Reparameterisation
We keep our existing model but we now want to re-parameterise the \(\nu\) Gamma distribution so we can specify but the mean and coefficient of variation.
## data {
## real<lower=0> p; // parameter 1 for Gamma top level
## real<lower=0> nu_mean; // parameter for mean of Gamma nu draw
## real<lower=0> nu_cv; // parameter for cov of Gamma nu draw
##
## int<lower=1> n; // count of transactions
## int<lower=1> n_customer_id; // count of customers
##
## array[n] int<lower=0> customer_id; // index for customer
##
## vector<lower=0>[n] tnx_amt; // transaction amount
## }
##
##
## transformed data {
## real<lower=0> q = 1 / (nu_cv * nu_cv);
## real<lower=0> g = 1 / (nu_cv * nu_cv * nu_mean);
## }
##
##
## parameters {
## vector<lower=0>[n_customer_id] nu;
## }
##
##
## transformed parameters {
## vector<lower=0>[n_customer_id] cust_mean = (p / nu);
## vector<lower=0>[n_customer_id] cust_cv;
##
## for(i in 1:n_customer_id) {
## cust_cv[i] = sqrt(p);
## }
## }
##
##
## model {
## nu ~ gamma(q, g);
##
## for(i in 1:n) {
## tnx_amt[i] ~ gamma(p, nu[customer_id[i]]);
## }
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = gamma_lpdf(tnx_amt[i] | p, nu[customer_id[i]]);
## }
## }
We now compile this model using CmdStanR.
amtmodel_flat_reparam_stanmodel <- cmdstan_model(
"stan_code/amtmodel_flat_reparam.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "amtmodel_flat_reparam"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, tnx_amt = tnx_amount) %>%
compose_data(
p = 100,
nu_mean = 1,
nu_cv = 1
)
amtmodel_flat_reparam_stanfit <- amtmodel_flat_reparam_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 13.1 seconds.
## Chain 1 finished in 13.5 seconds.
## Chain 3 finished in 13.6 seconds.
## Chain 4 finished in 13.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 13.4 seconds.
## Total execution time: 14.4 seconds.
amtmodel_flat_reparam_stanfit$summary()## # A tibble: 6,557 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -5.41e+5 -5.41e+5 2.22e+1 2.22e+1 -5.41e+5 -5.41e+5 1.01 642.
## 2 nu[1] 1.45e+0 1.44e+0 1.40e-1 1.42e-1 1.23e+0 1.69e+0 1.00 3190.
## 3 nu[2] 1.32e+0 1.32e+0 4.39e-2 4.42e-2 1.25e+0 1.40e+0 1.01 4330.
## 4 nu[3] 1.01e-1 1.01e-1 3.93e-3 3.79e-3 9.45e-2 1.07e-1 1.00 2308.
## 5 nu[4] 3.49e-1 3.49e-1 1.16e-2 1.12e-2 3.30e-1 3.68e-1 1.00 3094.
## 6 nu[5] 1.21e+0 1.21e+0 3.53e-2 3.45e-2 1.15e+0 1.27e+0 1.00 4167.
## 7 nu[6] 7.67e-2 7.67e-2 2.90e-3 2.87e-3 7.21e-2 8.15e-2 1.00 3199.
## 8 nu[7] 7.42e-2 7.41e-2 5.20e-3 5.23e-3 6.58e-2 8.32e-2 1.00 3989.
## 9 nu[8] 7.07e-1 7.08e-1 2.36e-2 2.37e-2 6.68e-1 7.46e-1 1.00 4385.
## 10 nu[9] 2.40e+0 2.40e+0 1.21e-1 1.22e-1 2.21e+0 2.60e+0 1.01 3755.
## # … with 6,547 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
amtmodel_flat_reparam_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_amtmodel_flat_reparam-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat_reparam-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat_reparam-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_flat_reparam-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
2.2.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"nu[1]", "nu[2]", "nu[3]", "nu[4]", "nu[5]", "nu[6]"
)
amtmodel_flat_reparam_stanfit$draws(inc_warmup = TRUE) %>%
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Posterior Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
amtmodel_flat_reparam_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(
pars = parameter_subset
) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Posterior Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
amtmodel_flat_reparam_stanfit %>%
rhat(pars = c("nu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
amtmodel_flat_reparam_stanfit %>%
neff_ratio(pars = c("nu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
amtmodel_flat_reparam_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Posterior Values")In practice we do not always require a comprehensive exploration of the diagnostics, but it is good practice to run through the various visualisations when we fit a model to ensure our sample is valid.
2.3 Check Model Fit
We now need to check the parameters of this fit against the data to see how effective our model is at capturing the data. In this case we have the benefit of knowing the ‘true’ data, and so we compare our model output against the input parameters.
amtmodel_flat_reparam_validation_tbl <- amtmodel_flat_reparam_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(nu[customer_id], cust_mean[customer_id], cust_cv[customer_id]) %>%
ungroup() %>%
inner_join(
customer_summarystats_tbl %>% select(customer_id, tnx_count),
by = "customer_id"
) %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, draw_id = .draw, tnx_count,
post_nu = nu, customer_p, customer_nu,
post_cust_mean = cust_mean, post_cust_cv = cust_cv
)
amtmodel_flat_reparam_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 8
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C20190…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ tnx_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ post_nu <dbl> 1.41266, 1.46895, 1.47528, 1.34603, 1.53986, 1.37628, 1…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ customer_nu <dbl> 1.546918, 1.546918, 1.546918, 1.546918, 1.546918, 1.546…
## $ post_cust_mean <dbl> 70.7883, 68.0757, 67.7839, 74.2925, 64.9408, 72.6597, 6…
## $ post_cust_cv <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
Having constructed the validation data we now want to check the quantile of each ‘true’ value in the posterior distribution for the parameter. If our model is valid, this distribution will be uniform on \([0, 1]\).
amtmodel_flat_reparam_nu_qvalues_tbl <- amtmodel_flat_reparam_validation_tbl %>%
calculate_distribution_qvals(post_nu, customer_nu, customer_id)
ref_value <- amtmodel_flat_reparam_nu_qvalues_tbl %>% nrow() %>% divide_by(50)
ggplot(amtmodel_flat_reparam_nu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(amtmodel_flat_reparam_nu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_nu)) +
labs(
x = "Quantile",
y = "Customer p",
title = "Scatterplot of q-Value against p"
)In real world examples we will not have any “true” parameter values though, so we also want to look at the distribution of \(q\)-values against the derived ECDFs:
amtmodel_flat_reparam_amt_qvals_tbl <- fit_1000_data_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
tnx_vals = list(tnx_amount)
) %>%
inner_join(
amtmodel_flat_reparam_validation_tbl %>% select(customer_id, customer_p, post_nu),
by = "customer_id"
) %>%
mutate(
q_vals = pmap(
list(x1 = tnx_vals, x2 = customer_p, x3 = post_nu),
function(x1, x2, x3) pgamma(q = x1, shape = x2, rate = x3)
)
) %>%
select(customer_id, customer_p, post_nu, q_vals) %>%
unnest(q_vals)
amtmodel_flat_reparam_amt_qvals_tbl %>% glimpse()## Rows: 7,112,000
## Columns: 4
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C201901_0…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ post_nu <dbl> 1.41266, 1.46895, 1.47528, 1.34603, 1.53986, 1.37628, 1.58…
## $ q_vals <dbl> 0.39124142, 0.54500141, 0.56205240, 0.22652759, 0.72218614…
3 Adding Hierarchical Priors
We now want to add extra Stan models that allows us to add some uncertainty for the values of \(p\).
3.1 Fit Hierarchical p Stan Model with Reparameterisation
We now add a Gamma prior to our value of \(p\).
## data {
## real<lower=0> p_mean; // parameter 1 for Gamma hierarchy for p
## real<lower=0> p_cv; // parameter 2 for Gamma hierarchy for p
##
## real<lower=0> nu_mean; // parameter for mean of Gamma nu draw
## real<lower=0> nu_cv; // parameter for cov of Gamma nu draw
##
## int<lower=1> n; // count of transactions
## int<lower=1> n_customer_id; // count of customers
##
## array[n] int<lower=0> customer_id; // index for customer
##
## vector<lower=0>[n] tnx_amt; // transaction amount
## }
##
## transformed data {
## // shape <- 1 / (cv^2)
## // rate <- 1 / mu * cv^2
##
## real<lower=0> p_shape = 1 / (p_cv * p_cv);
## real<lower=0> p_rate = 1 / (p_cv * p_cv * p_mean);
##
## real<lower=0> q = 1 / (nu_cv * nu_cv);
## real<lower=0> g = 1 / (nu_cv * nu_cv * nu_mean);
## }
##
##
## parameters {
## real<lower=0> p;
##
## vector<lower=0>[n_customer_id] nu;
## }
##
##
## transformed parameters {
## vector<lower=0>[n_customer_id] cust_mean = (p / nu);
## vector<lower=0>[n_customer_id] cust_cv;
##
## for(i in 1:n_customer_id) {
## cust_cv[i] = sqrt(p);
## }
## }
##
##
## model {
## p ~ gamma(p_shape, p_rate);
## nu ~ gamma(q, g);
##
## for(i in 1:n) {
## tnx_amt[i] ~ gamma(p, nu[customer_id[i]]);
## }
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = gamma_lpdf(tnx_amt[i] | p, nu[customer_id[i]]);
## }
## }
We now compile this model using CmdStanR.
amtmodel_hier_p_stanmodel <- cmdstan_model(
"stan_code/amtmodel_hier_p.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "amtmodel_hier_p"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, tnx_amt = tnx_amount) %>%
compose_data(
p_mean = 100,
p_cv = 1,
nu_mean = 1,
nu_cv = 1
)
amtmodel_hier_p_stanfit <- amtmodel_hier_p_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4202,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 38.0 seconds.
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 38.9 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 39.7 seconds.
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 44.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 40.3 seconds.
## Total execution time: 45.4 seconds.
amtmodel_hier_p_stanfit$summary()## # A tibble: 6,558 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -1.66e+4 -1.66e+4 2.41e+1 2.34e+1 -1.67e+4 -1.66e+4 1.03 139.
## 2 p 9.87e+1 9.87e+1 1.86e+0 1.86e+0 9.57e+1 1.02e+2 1.17 17.9
## 3 nu[1] 1.43e+0 1.42e+0 1.43e-1 1.43e-1 1.21e+0 1.67e+0 1.01 1258.
## 4 nu[2] 1.31e+0 1.30e+0 4.97e-2 4.97e-2 1.23e+0 1.39e+0 1.03 94.3
## 5 nu[3] 9.96e-2 9.95e-2 4.20e-3 4.04e-3 9.28e-2 1.07e-1 1.03 95.3
## 6 nu[4] 3.44e-1 3.44e-1 1.26e-2 1.24e-2 3.24e-1 3.65e-1 1.04 89.0
## 7 nu[5] 1.19e+0 1.19e+0 4.10e-2 4.21e-2 1.13e+0 1.26e+0 1.04 63.7
## 8 nu[6] 7.57e-2 7.57e-2 3.23e-3 3.24e-3 7.05e-2 8.12e-2 1.02 148.
## 9 nu[7] 7.33e-2 7.30e-2 5.43e-3 5.38e-3 6.46e-2 8.24e-2 1.00 871.
## 10 nu[8] 6.99e-1 6.98e-1 2.57e-2 2.53e-2 6.57e-1 7.43e-1 1.04 77.3
## # … with 6,548 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
amtmodel_hier_p_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## The following parameters had split R-hat greater than 1.05:
## p, cust_cv[1], cust_cv[2], cust_cv[3], cust_cv[4], cust_cv[5], cust_cv[6], cust_cv[7], cust_cv[8], cust_cv[9], cust_cv[10], cust_cv[11], cust_cv[12], cust_cv[13], cust_cv[14], cust_cv[15], cust_cv[16], cust_cv[17], cust_cv[18], cust_cv[19], cust_cv[20], cust_cv[21], cust_cv[22], cust_cv[23], cust_cv[24], cust_cv[25], cust_cv[26], cust_cv[27], cust_cv[28], cust_cv[29], cust_cv[30], cust_cv[31], cust_cv[32], cust_cv[33], cust_cv[34], cust_cv[35], cust_cv[36], cust_cv[37], cust_cv[38], cust_cv[39], cust_cv[40], cust_cv[41], cust_cv[42], cust_cv[43], cust_cv[44], cust_cv[45], cust_cv[46], cust_cv[47], cust_cv[48], cust_cv[49], cust_cv[50], cust_cv[51], cust_cv[52], cust_cv[53], cust_cv[54], cust_cv[55], cust_cv[56], cust_cv[57], cust_cv[58], cust_cv[59], cust_cv[60], cust_cv[61], cust_cv[62], cust_cv[63], cust_cv[64], cust_cv[65], cust_cv[66], cust_cv[67], cust_cv[68], cust_cv[69], cust_cv[70], cust_cv[71], cust_cv[72], cust_cv[73], cust_cv[74], cust_cv[75], cust_cv[76], cust_cv[77], cust_cv[78], cust_cv[79], cust_cv[80], cust_cv[81], cust_cv[82], cust_cv[83], cust_cv[84], cust_cv[85], cust_cv[86], cust_cv[87], cust_cv[88], cust_cv[89], cust_cv[90], cust_cv[91], cust_cv[92], cust_cv[93], cust_cv[94], cust_cv[95], cust_cv[96], cust_cv[97], cust_cv[98], cust_cv[99], cust_cv[100], cust_cv[101], cust_cv[102], cust_cv[103], cust_cv[104], cust_cv[105], cust_cv[106], cust_cv[107], cust_cv[108], cust_cv[109], cust_cv[110], cust_cv[111], cust_cv[112], cust_cv[113], cust_cv[114], cust_cv[115], cust_cv[116], cust_cv[117], cust_cv[118], cust_cv[119], cust_cv[120], cust_cv[121], cust_cv[122], cust_cv[123], cust_cv[124], cust_cv[125], cust_cv[126], cust_cv[127], cust_cv[128], cust_cv[129], cust_cv[130], cust_cv[131], cust_cv[132], cust_cv[133], cust_cv[134], cust_cv[135], cust_cv[136], cust_cv[137], cust_cv[138], cust_cv[139], cust_cv[140], cust_cv[141], cust_cv[142], cust_cv[143], cust_cv[144], cust_cv[145], cust_cv[146], cust_cv[147], cust_cv[148], cust_cv[149], cust_cv[150], cust_cv[151], cust_cv[152], cust_cv[153], cust_cv[154], cust_cv[155], cust_cv[156], cust_cv[157], cust_cv[158], cust_cv[159], cust_cv[160], cust_cv[161], cust_cv[162], cust_cv[163], cust_cv[164], cust_cv[165], cust_cv[166], cust_cv[167], cust_cv[168], cust_cv[169], cust_cv[170], cust_cv[171], cust_cv[172], cust_cv[173], cust_cv[174], cust_cv[175], cust_cv[176], cust_cv[177], cust_cv[178], cust_cv[179], cust_cv[180], cust_cv[181], cust_cv[182], cust_cv[183], cust_cv[184], cust_cv[185], cust_cv[186], cust_cv[187], cust_cv[188], cust_cv[189], cust_cv[190], cust_cv[191], cust_cv[192], cust_cv[193], cust_cv[194], cust_cv[195], cust_cv[196], cust_cv[197], cust_cv[198], cust_cv[199], cust_cv[200], cust_cv[201], cust_cv[202], cust_cv[203], cust_cv[204], cust_cv[205], cust_cv[206], cust_cv[207], cust_cv[208], cust_cv[209], cust_cv[210], cust_cv[211], cust_cv[212], cust_cv[213], cust_cv[214], cust_cv[215], cust_cv[216], cust_cv[217], cust_cv[218], cust_cv[219], cust_cv[220], cust_cv[221], cust_cv[222], cust_cv[223], cust_cv[224], cust_cv[225], cust_cv[226], cust_cv[227], cust_cv[228], cust_cv[229], cust_cv[230], cust_cv[231], cust_cv[232], cust_cv[233], cust_cv[234], cust_cv[235], cust_cv[236], cust_cv[237], cust_cv[238], cust_cv[239], cust_cv[240], cust_cv[241], cust_cv[242], cust_cv[243], cust_cv[244], cust_cv[245], cust_cv[246], cust_cv[247], cust_cv[248], cust_cv[249], cust_cv[250], cust_cv[251], cust_cv[252], cust_cv[253], cust_cv[254], cust_cv[255], cust_cv[256], cust_cv[257], cust_cv[258], cust_cv[259], cust_cv[260], cust_cv[261], cust_cv[262], cust_cv[263], cust_cv[264], cust_cv[265], cust_cv[266], cust_cv[267], cust_cv[268], cust_cv[269], cust_cv[270], cust_cv[271], cust_cv[272], cust_cv[273], cust_cv[274], cust_cv[275], cust_cv[276], cust_cv[277], cust_cv[278], cust_cv[279], cust_cv[280], cust_cv[281], cust_cv[282], cust_cv[283], cust_cv[284], cust_cv[285], cust_cv[286], cust_cv[287], cust_cv[288], cust_cv[289], cust_cv[290], cust_cv[291], cust_cv[292], cust_cv[293], cust_cv[294], cust_cv[295], cust_cv[296], cust_cv[297], cust_cv[298], cust_cv[299], cust_cv[300], cust_cv[301], cust_cv[302], cust_cv[303], cust_cv[304], cust_cv[305], cust_cv[306], cust_cv[307], cust_cv[308], cust_cv[309], cust_cv[310], cust_cv[311], cust_cv[312], cust_cv[313], cust_cv[314], cust_cv[315], cust_cv[316], cust_cv[317], cust_cv[318], cust_cv[319], cust_cv[320], cust_cv[321], cust_cv[322], cust_cv[323], cust_cv[324], cust_cv[325], cust_cv[326], cust_cv[327], cust_cv[328], cust_cv[329], cust_cv[330], cust_cv[331], cust_cv[332], cust_cv[333], cust_cv[334], cust_cv[335], cust_cv[336], cust_cv[337], cust_cv[338], cust_cv[339], cust_cv[340], cust_cv[341], cust_cv[342], cust_cv[343], cust_cv[344], cust_cv[345], cust_cv[346], cust_cv[347], cust_cv[348], cust_cv[349], cust_cv[350], cust_cv[351], cust_cv[352], cust_cv[353], cust_cv[354], cust_cv[355], cust_cv[356], cust_cv[357], cust_cv[358], cust_cv[359], cust_cv[360], cust_cv[361], cust_cv[362], cust_cv[363], cust_cv[364], cust_cv[365], cust_cv[366], cust_cv[367], cust_cv[368], cust_cv[369], cust_cv[370], cust_cv[371], cust_cv[372], cust_cv[373], cust_cv[374], cust_cv[375], cust_cv[376], cust_cv[377], cust_cv[378], cust_cv[379], cust_cv[380], cust_cv[381], cust_cv[382], cust_cv[383], cust_cv[384], cust_cv[385], cust_cv[386], cust_cv[387], cust_cv[388], cust_cv[389], cust_cv[390], cust_cv[391], cust_cv[392], cust_cv[393], cust_cv[394], cust_cv[395], cust_cv[396], cust_cv[397], cust_cv[398], cust_cv[399], cust_cv[400], cust_cv[401], cust_cv[402], cust_cv[403], cust_cv[404], cust_cv[405], cust_cv[406], cust_cv[407], cust_cv[408], cust_cv[409], cust_cv[410], cust_cv[411], cust_cv[412], cust_cv[413], cust_cv[414], cust_cv[415], cust_cv[416], cust_cv[417], cust_cv[418], cust_cv[419], cust_cv[420], cust_cv[421], cust_cv[422], cust_cv[423], cust_cv[424], cust_cv[425], cust_cv[426], cust_cv[427], cust_cv[428], cust_cv[429], cust_cv[430], cust_cv[431], cust_cv[432], cust_cv[433], cust_cv[434], cust_cv[435], cust_cv[436], cust_cv[437], cust_cv[438], cust_cv[439], cust_cv[440], cust_cv[441], cust_cv[442], cust_cv[443], cust_cv[444], cust_cv[445], cust_cv[446], cust_cv[447], cust_cv[448], cust_cv[449], cust_cv[450], cust_cv[451], cust_cv[452], cust_cv[453], cust_cv[454], cust_cv[455], cust_cv[456], cust_cv[457], cust_cv[458], cust_cv[459], cust_cv[460], cust_cv[461], cust_cv[462], cust_cv[463], cust_cv[464], cust_cv[465], cust_cv[466], cust_cv[467], cust_cv[468], cust_cv[469], cust_cv[470], cust_cv[471], cust_cv[472], cust_cv[473], cust_cv[474], cust_cv[475], cust_cv[476], cust_cv[477], cust_cv[478], cust_cv[479], cust_cv[480], cust_cv[481], cust_cv[482], cust_cv[483], cust_cv[484], cust_cv[485], cust_cv[486], cust_cv[487], cust_cv[488], cust_cv[489], cust_cv[490], cust_cv[491], cust_cv[492], cust_cv[493], cust_cv[494], cust_cv[495], cust_cv[496], cust_cv[497], cust_cv[498], cust_cv[499], cust_cv[500], cust_cv[501], cust_cv[502], cust_cv[503], cust_cv[504], cust_cv[505], cust_cv[506], cust_cv[507], cust_cv[508], cust_cv[509], cust_cv[510], cust_cv[511], cust_cv[512], cust_cv[513], cust_cv[514], cust_cv[515], cust_cv[516], cust_cv[517], cust_cv[518], cust_cv[519], cust_cv[520], cust_cv[521], cust_cv[522], cust_cv[523], cust_cv[524], cust_cv[525], cust_cv[526], cust_cv[527], cust_cv[528], cust_cv[529], cust_cv[530], cust_cv[531], cust_cv[532], cust_cv[533], cust_cv[534], cust_cv[535], cust_cv[536], cust_cv[537], cust_cv[538], cust_cv[539], cust_cv[540], cust_cv[541], cust_cv[542], cust_cv[543], cust_cv[544], cust_cv[545], cust_cv[546], cust_cv[547], cust_cv[548], cust_cv[549], cust_cv[550], cust_cv[551], cust_cv[552], cust_cv[553], cust_cv[554], cust_cv[555], cust_cv[556], cust_cv[557], cust_cv[558], cust_cv[559], cust_cv[560], cust_cv[561], cust_cv[562], cust_cv[563], cust_cv[564], cust_cv[565], cust_cv[566], cust_cv[567], cust_cv[568], cust_cv[569], cust_cv[570], cust_cv[571], cust_cv[572], cust_cv[573], cust_cv[574], cust_cv[575], cust_cv[576], cust_cv[577], cust_cv[578], cust_cv[579], cust_cv[580], cust_cv[581], cust_cv[582], cust_cv[583], cust_cv[584], cust_cv[585], cust_cv[586], cust_cv[587], cust_cv[588], cust_cv[589], cust_cv[590], cust_cv[591], cust_cv[592], cust_cv[593], cust_cv[594], cust_cv[595], cust_cv[596], cust_cv[597], cust_cv[598], cust_cv[599], cust_cv[600], cust_cv[601], cust_cv[602], cust_cv[603], cust_cv[604], cust_cv[605], cust_cv[606], cust_cv[607], cust_cv[608], cust_cv[609], cust_cv[610], cust_cv[611], cust_cv[612], cust_cv[613], cust_cv[614], cust_cv[615], cust_cv[616], cust_cv[617], cust_cv[618], cust_cv[619], cust_cv[620], cust_cv[621], cust_cv[622], cust_cv[623], cust_cv[624], cust_cv[625], cust_cv[626], cust_cv[627], cust_cv[628], cust_cv[629], cust_cv[630], cust_cv[631], cust_cv[632], cust_cv[633], cust_cv[634], cust_cv[635], cust_cv[636], cust_cv[637], cust_cv[638], cust_cv[639], cust_cv[640], cust_cv[641], cust_cv[642], cust_cv[643], cust_cv[644], cust_cv[645], cust_cv[646], cust_cv[647], cust_cv[648], cust_cv[649], cust_cv[650], cust_cv[651], cust_cv[652], cust_cv[653], cust_cv[654], cust_cv[655], cust_cv[656], cust_cv[657], cust_cv[658], cust_cv[659], cust_cv[660], cust_cv[661], cust_cv[662], cust_cv[663], cust_cv[664], cust_cv[665], cust_cv[666], cust_cv[667], cust_cv[668], cust_cv[669], cust_cv[670], cust_cv[671], cust_cv[672], cust_cv[673], cust_cv[674], cust_cv[675], cust_cv[676], cust_cv[677], cust_cv[678], cust_cv[679], cust_cv[680], cust_cv[681], cust_cv[682], cust_cv[683], cust_cv[684], cust_cv[685], cust_cv[686], cust_cv[687], cust_cv[688], cust_cv[689], cust_cv[690], cust_cv[691], cust_cv[692], cust_cv[693], cust_cv[694], cust_cv[695], cust_cv[696], cust_cv[697], cust_cv[698], cust_cv[699], cust_cv[700], cust_cv[701], cust_cv[702], cust_cv[703], cust_cv[704], cust_cv[705], cust_cv[706], cust_cv[707], cust_cv[708], cust_cv[709], cust_cv[710], cust_cv[711], cust_cv[712], cust_cv[713], cust_cv[714], cust_cv[715], cust_cv[716], cust_cv[717], cust_cv[718], cust_cv[719], cust_cv[720], cust_cv[721], cust_cv[722], cust_cv[723], cust_cv[724], cust_cv[725], cust_cv[726], cust_cv[727], cust_cv[728], cust_cv[729], cust_cv[730], cust_cv[731], cust_cv[732], cust_cv[733], cust_cv[734], cust_cv[735], cust_cv[736], cust_cv[737], cust_cv[738], cust_cv[739], cust_cv[740], cust_cv[741], cust_cv[742], cust_cv[743], cust_cv[744], cust_cv[745], cust_cv[746], cust_cv[747], cust_cv[748], cust_cv[749], cust_cv[750], cust_cv[751], cust_cv[752], cust_cv[753], cust_cv[754], cust_cv[755], cust_cv[756], cust_cv[757], cust_cv[758], cust_cv[759], cust_cv[760], cust_cv[761], cust_cv[762], cust_cv[763], cust_cv[764], cust_cv[765], cust_cv[766], cust_cv[767], cust_cv[768], cust_cv[769], cust_cv[770], cust_cv[771], cust_cv[772], cust_cv[773], cust_cv[774], cust_cv[775], cust_cv[776], cust_cv[777], cust_cv[778], cust_cv[779], cust_cv[780], cust_cv[781], cust_cv[782], cust_cv[783], cust_cv[784], cust_cv[785], cust_cv[786], cust_cv[787], cust_cv[788], cust_cv[789], cust_cv[790], cust_cv[791], cust_cv[792], cust_cv[793], cust_cv[794], cust_cv[795], cust_cv[796], cust_cv[797], cust_cv[798], cust_cv[799], cust_cv[800], cust_cv[801], cust_cv[802], cust_cv[803], cust_cv[804], cust_cv[805], cust_cv[806], cust_cv[807], cust_cv[808], cust_cv[809], cust_cv[810], cust_cv[811], cust_cv[812], cust_cv[813], cust_cv[814], cust_cv[815], cust_cv[816], cust_cv[817], cust_cv[818], cust_cv[819], cust_cv[820], cust_cv[821], cust_cv[822], cust_cv[823], cust_cv[824], cust_cv[825], cust_cv[826], cust_cv[827], cust_cv[828], cust_cv[829], cust_cv[830], cust_cv[831], cust_cv[832], cust_cv[833], cust_cv[834], cust_cv[835], cust_cv[836], cust_cv[837], cust_cv[838], cust_cv[839], cust_cv[840], cust_cv[841], cust_cv[842], cust_cv[843], cust_cv[844], cust_cv[845], cust_cv[846], cust_cv[847], cust_cv[848], cust_cv[849], cust_cv[850], cust_cv[851], cust_cv[852], cust_cv[853], cust_cv[854], cust_cv[855], cust_cv[856], cust_cv[857], cust_cv[858], cust_cv[859], cust_cv[860], cust_cv[861], cust_cv[862], cust_cv[863], cust_cv[864], cust_cv[865], cust_cv[866], cust_cv[867], cust_cv[868], cust_cv[869], cust_cv[870], cust_cv[871], cust_cv[872], cust_cv[873], cust_cv[874], cust_cv[875], cust_cv[876], cust_cv[877], cust_cv[878], cust_cv[879], cust_cv[880], cust_cv[881], cust_cv[882], cust_cv[883], cust_cv[884], cust_cv[885], cust_cv[886], cust_cv[887], cust_cv[888], cust_cv[889], cust_cv[890], cust_cv[891], cust_cv[892], cust_cv[893], cust_cv[894], cust_cv[895], cust_cv[896], cust_cv[897], cust_cv[898], cust_cv[899], cust_cv[900], cust_cv[901], cust_cv[902], cust_cv[903], cust_cv[904], cust_cv[905], cust_cv[906], cust_cv[907], cust_cv[908], cust_cv[909], cust_cv[910], cust_cv[911], cust_cv[912], cust_cv[913], cust_cv[914], cust_cv[915], cust_cv[916], cust_cv[917], cust_cv[918], cust_cv[919], cust_cv[920], cust_cv[921], cust_cv[922], cust_cv[923], cust_cv[924], cust_cv[925], cust_cv[926], cust_cv[927], cust_cv[928], cust_cv[929], cust_cv[930], cust_cv[931], cust_cv[932], cust_cv[933], cust_cv[934], cust_cv[935], cust_cv[936], cust_cv[937], cust_cv[938], cust_cv[939], cust_cv[940], cust_cv[941], cust_cv[942], cust_cv[943], cust_cv[944], cust_cv[945], cust_cv[946], cust_cv[947], cust_cv[948], cust_cv[949], cust_cv[950], cust_cv[951], cust_cv[952], cust_cv[953], cust_cv[954], cust_cv[955], cust_cv[956], cust_cv[957], cust_cv[958], cust_cv[959], cust_cv[960], cust_cv[961], cust_cv[962], cust_cv[963], cust_cv[964], cust_cv[965], cust_cv[966], cust_cv[967], cust_cv[968], cust_cv[969], cust_cv[970], cust_cv[971], cust_cv[972], cust_cv[973], cust_cv[974], cust_cv[975], cust_cv[976], cust_cv[977], cust_cv[978], cust_cv[979], cust_cv[980], cust_cv[981], cust_cv[982], cust_cv[983], cust_cv[984], cust_cv[985], cust_cv[986], cust_cv[987], cust_cv[988], cust_cv[989], cust_cv[990], cust_cv[991], cust_cv[992], cust_cv[993], cust_cv[994], cust_cv[995], cust_cv[996], cust_cv[997], cust_cv[998], cust_cv[999], cust_cv[1000]
## Such high values indicate incomplete mixing and biased estimation.
## You should consider regularizating your model with additional prior information or a more effective parameterization.
##
## Processing complete.
There is an issue with the sampling of the \(p\) value, with a suggestion the chains are not mixed properly. We diagnose this by looking at the traceplots and other validation metrics.
3.1.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"p", "nu[1]", "nu[2]", "nu[3]", "nu[4]", "nu[5]"
)
amtmodel_hier_p_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(
pars = parameter_subset
) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Posterior Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
amtmodel_hier_p_stanfit %>%
rhat(pars = c("p", "nu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
amtmodel_hier_p_stanfit %>%
neff_ratio(pars = c("p", "nu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")It is possible our prior on \(p\) is too wide, so we try this out by creating
a few plots using some utility functions rgamma_mucv.
ggplot() +
geom_histogram(aes(x = rgamma_mucv(n = 10000, mu = 100, cv = 1)), bins = 50) +
labs(
x = "Sample Draw",
y = "Frequency",
title = "Histograms of Draws from Gamma with mu = 100, cv = 1"
)This gives us a very wide spread of values, so we try to reduce the dispersion in the prior, reducing the coefficient of variation to 0.1.
ggplot() +
geom_histogram(aes(x = rgamma_mucv(n = 10000, mu = 100, cv = 0.1)), bins = 50) +
labs(
x = "Sample Draw",
y = "Frequency",
title = "Histograms of Draws from Gamma with mu = 100, cv = 0.1"
)3.2 Fit Hierarchical p Stan Model with Tighter Priors
We keep our existing model but we now redo the model with the tighter prior on the \(p\) parameter.
stan_modelname <- "amtmodel_hier_p_tight"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, tnx_amt = tnx_amount) %>%
compose_data(
p_mean = 100,
p_cv = 0.1,
nu_mean = 1,
nu_cv = 1
)
amtmodel_hier_p_stanfit <- amtmodel_hier_p_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4202,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 38.0 seconds.
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 finished in 47.8 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 49.6 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 51.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 46.7 seconds.
## Total execution time: 51.7 seconds.
amtmodel_hier_p_stanfit$summary()## # A tibble: 6,558 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -1.63e+4 -1.63e+4 2.40e+1 2.33e+1 -1.63e+4 -1.62e+4 1.01 348.
## 2 p 9.92e+1 9.92e+1 1.52e+0 1.53e+0 9.66e+1 1.02e+2 1.04 56.9
## 3 nu[1] 1.44e+0 1.44e+0 1.47e-1 1.40e-1 1.21e+0 1.69e+0 1.01 2058.
## 4 nu[2] 1.31e+0 1.31e+0 4.91e-2 4.95e-2 1.23e+0 1.39e+0 1.01 330.
## 5 nu[3] 9.99e-2 9.98e-2 4.22e-3 4.26e-3 9.28e-2 1.07e-1 1.00 527.
## 6 nu[4] 3.46e-1 3.46e-1 1.18e-2 1.19e-2 3.27e-1 3.65e-1 1.01 280.
## 7 nu[5] 1.20e+0 1.20e+0 3.98e-2 3.88e-2 1.14e+0 1.27e+0 1.01 299.
## 8 nu[6] 7.62e-2 7.61e-2 3.16e-3 3.25e-3 7.12e-2 8.15e-2 1.01 454.
## 9 nu[7] 7.36e-2 7.36e-2 5.30e-3 5.26e-3 6.49e-2 8.25e-2 1.01 2108.
## 10 nu[8] 7.01e-1 7.01e-1 2.57e-2 2.53e-2 6.59e-1 7.44e-1 1.00 405.
## # … with 6,548 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
amtmodel_hier_p_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p_tight-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p_tight-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p_tight-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_amtmodel_hier_p_tight-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.2.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
amtmodel_hier_p_stanfit$draws(inc_warmup = TRUE) %>%
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Posterior Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
amtmodel_hier_p_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(
pars = parameter_subset
) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Posterior Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
amtmodel_hier_p_stanfit %>%
rhat(pars = c("p", "nu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
amtmodel_hier_p_stanfit %>%
neff_ratio(pars = c("p", "nu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
amtmodel_hier_p_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Posterior Values")In practice we do not always require a comprehensive exploration of the diagnostics, but it is good practice to run through the various visualisations when we fit a model to ensure our sample is valid.
3.3 Check Model Fit
We now need to check the parameters of this fit against the data to see how effective our model is at capturing the data. In this case we have the benefit of knowing the ‘true’ data, and so we compare our model output against the input parameters.
amtmodel_hier_p_validation_tbl <- amtmodel_hier_p_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(nu[customer_id]) %>%
ungroup() %>%
inner_join(
customer_summarystats_tbl %>% select(customer_id, tnx_count),
by = "customer_id"
) %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(
customer_id, draw_id = .draw, tnx_count,
post_nu = nu, customer_p, customer_nu
)
amtmodel_hier_p_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 6
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C201901_0…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ tnx_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ post_nu <dbl> 1.52051, 1.39534, 1.56869, 1.42026, 1.40443, 1.39675, 1.47…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ customer_nu <dbl> 1.546918, 1.546918, 1.546918, 1.546918, 1.546918, 1.546918…
Having constructed the validation data we now want to check the quantile of each ‘true’ value in the posterior distribution for the parameter. If our model is valid, this distribution will be uniform on \([0, 1]\).
amtmodel_hier_p_nu_qvalues_tbl <- amtmodel_hier_p_validation_tbl %>%
calculate_distribution_qvals(post_nu, customer_nu, customer_id)
ref_value <- amtmodel_hier_p_nu_qvalues_tbl %>% nrow() %>% divide_by(50)
ggplot(amtmodel_hier_p_nu_qvalues_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = ref_value), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(amtmodel_hier_p_nu_qvalues_tbl) +
geom_point(aes(x = q_val, y = customer_nu)) +
labs(
x = "Quantile",
y = "Customer p",
title = "Scatterplot of q-Value against p"
)In real world examples we will not have any “true” parameter values though, so we also want to look at the distribution of \(q\)-values against the derived ECDFs:
amtmodel_hier_p_amt_qvals_tbl <- fit_1000_data_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
tnx_vals = list(tnx_amount)
) %>%
inner_join(
amtmodel_hier_p_validation_tbl %>% select(customer_id, customer_p, post_nu),
by = "customer_id"
) %>%
mutate(
q_vals = pmap(
list(x1 = tnx_vals, x2 = customer_p, x3 = post_nu),
function(x1, x2, x3) pgamma(q = x1, shape = x2, rate = x3)
)
) %>%
select(customer_id, customer_p, post_nu, q_vals) %>%
unnest(q_vals)
amtmodel_hier_p_amt_qvals_tbl %>% glimpse()## Rows: 7,112,000
## Columns: 4
## $ customer_id <chr> "C201901_0025", "C201901_0025", "C201901_0025", "C201901_0…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
## $ post_nu <dbl> 1.52051, 1.39534, 1.56869, 1.42026, 1.40443, 1.39675, 1.47…
## $ q_vals <dbl> 0.67754517, 0.34535346, 0.78196159, 0.41178741, 0.36924979…
4 R Environment
options(width = 120L)
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-08-03
## pandoc 2.17.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
## arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.2.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.2.0)
## bayesplot * 1.9.0 2022-03-10 [1] RSPM (R 4.2.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
## bookdown 0.27 2022-06-14 [1] RSPM (R 4.2.0)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.0)
## bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.2.0)
## brms * 2.17.0 2022-08-03 [1] Github (paul-buerkner/brms@a43937c)
## Brobdingnag 1.2-7 2022-02-03 [1] RSPM (R 4.2.0)
## broom 0.8.0 2022-04-13 [1] RSPM (R 4.2.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.2.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.2.0)
## callr 3.7.0 2021-04-20 [1] RSPM (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.2.0)
## checkmate 2.1.0 2022-04-21 [1] RSPM (R 4.2.0)
## cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
## cmdstanr * 0.5.3 2022-08-03 [1] Github (stan-dev/cmdstanr@22b391e)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.2.0)
## colourpicker 1.1.1 2021-10-04 [1] RSPM (R 4.2.0)
## conflicted * 1.1.0 2021-11-26 [1] RSPM (R 4.2.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.2.0)
## crayon 1.5.1 2022-03-26 [1] RSPM (R 4.2.0)
## crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] RSPM (R 4.2.0)
## dbplyr 2.2.0 2022-06-05 [1] RSPM (R 4.2.0)
## digest 0.6.29 2021-12-01 [1] RSPM (R 4.2.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.2.0)
## distributional 0.3.0 2022-01-05 [1] RSPM (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] RSPM (R 4.2.0)
## DT 0.23 2022-05-10 [1] RSPM (R 4.2.0)
## dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0)
## evaluate 0.15 2022-02-18 [1] RSPM (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] RSPM (R 4.2.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.2.0)
## fs * 1.5.2 2021-12-08 [1] RSPM (R 4.2.0)
## furrr * 0.3.0 2022-05-04 [1] RSPM (R 4.2.0)
## future * 1.26.1 2022-05-27 [1] RSPM (R 4.2.0)
## gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.2.0)
## generics 0.1.2 2022-01-31 [1] RSPM (R 4.2.0)
## ggdist 3.1.1 2022-02-27 [1] RSPM (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] RSPM (R 4.2.0)
## ggridges 0.5.3 2021-01-08 [1] RSPM (R 4.2.0)
## globals 0.15.0 2022-05-09 [1] RSPM (R 4.2.0)
## glue * 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.2.0)
## gtools 3.9.2.2 2022-06-13 [1] RSPM (R 4.2.0)
## haven 2.5.0 2022-04-15 [1] RSPM (R 4.2.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.2.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.2.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.2.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.2.0)
## httpuv 1.6.5 2022-01-05 [1] RSPM (R 4.2.0)
## httr 1.4.3 2022-05-04 [1] RSPM (R 4.2.0)
## igraph 1.3.2 2022-06-13 [1] RSPM (R 4.2.0)
## inline 0.3.19 2021-05-31 [1] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [1] RSPM (R 4.2.0)
## knitr 1.39 2022-04-26 [1] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
## later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.2.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.2.0)
## lme4 1.1-29 2022-04-07 [1] RSPM (R 4.2.0)
## loo 2.5.1 2022-03-24 [1] RSPM (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [1] RSPM (R 4.2.0)
## magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.2.0)
## markdown 1.1 2019-08-07 [1] RSPM (R 4.2.0)
## MASS 7.3-56 2022-03-23 [2] CRAN (R 4.2.0)
## Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.0)
## matrixStats 0.62.0 2022-04-19 [1] RSPM (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0)
## mgcv 1.8-40 2022-03-29 [2] CRAN (R 4.2.0)
## mime 0.12 2021-09-28 [1] RSPM (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.2.0)
## minqa 1.2.4 2014-10-09 [1] RSPM (R 4.2.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
## mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0)
## nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.0)
## nloptr 2.0.3 2022-05-26 [1] RSPM (R 4.2.0)
## parallelly 1.32.0 2022-06-07 [1] RSPM (R 4.2.0)
## pillar 1.7.0 2022-02-01 [1] RSPM (R 4.2.0)
## pkgbuild 1.3.1 2021-12-20 [1] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0)
## plyr 1.8.7 2022-03-24 [1] RSPM (R 4.2.0)
## posterior * 1.2.2 2022-06-09 [1] RSPM (R 4.2.0)
## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.2.0)
## processx 3.6.1 2022-06-17 [1] RSPM (R 4.2.0)
## projpred 2.1.2 2022-05-13 [1] RSPM (R 4.2.0)
## promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0)
## ps 1.7.1 2022-06-18 [1] RSPM (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.2.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0)
## Rcpp * 1.0.8.3 2022-03-17 [1] RSPM (R 4.2.0)
## RcppParallel 5.1.5 2022-01-05 [1] RSPM (R 4.2.0)
## readr * 2.1.2 2022-01-30 [1] RSPM (R 4.2.0)
## readxl 1.4.0 2022-03-28 [1] RSPM (R 4.2.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.2.0)
## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.2.0)
## rlang * 1.0.2 2022-03-04 [1] RSPM (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [1] RSPM (R 4.2.0)
## rmdformats 1.0.4 2022-05-17 [1] RSPM (R 4.2.0)
## rstan 2.26.13 2022-08-03 [1] local
## rstantools 2.2.0 2022-04-08 [1] RSPM (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.2.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.2.0)
## sass 0.4.1 2022-03-23 [1] RSPM (R 4.2.0)
## scales * 1.2.0 2022-04-13 [1] RSPM (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
## shiny 1.7.1 2021-10-02 [1] RSPM (R 4.2.0)
## shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.2.0)
## shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.2.0)
## shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.2.0)
## StanHeaders 2.26.13 2022-08-03 [1] local
## stringi 1.7.6 2021-11-29 [1] RSPM (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.2.0)
## svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.2.0)
## tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.2.0)
## threejs 0.3.3 2020-01-21 [1] RSPM (R 4.2.0)
## tibble * 3.1.7 2022-05-03 [1] RSPM (R 4.2.0)
## tidybayes * 3.0.2 2022-01-05 [1] RSPM (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.2.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.2.0)
## V8 4.2.0 2022-05-14 [1] RSPM (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [1] RSPM (R 4.2.0)
## vroom 1.5.7 2021-11-30 [1] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0)
## xfun 0.31 2022-05-10 [1] RSPM (R 4.2.0)
## xml2 1.3.3 2021-11-30 [1] RSPM (R 4.2.0)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0)
## xts 0.12.1 2020-09-09 [1] RSPM (R 4.2.0)
## yaml 2.3.5 2022-02-21 [1] RSPM (R 4.2.0)
## zoo 1.8-10 2022-04-15 [1] RSPM (R 4.2.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)